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Abstract: Wc investigate in this paper the estimation of Gaussian graphs 
by model selection from a non-asymptotic point of view. We start from an 
n-sample of a Gaussian law Pc in R*' and focus on the disadvantageous case 
where n is smaller than p. To estimate the graph of conditional dependences 
of Pc , we introduce a collection of candidate graphs and then select one of 
them by minimizing a penalized empirical risk. Our main result assesses the 
performance of the procedure in a non-asymptotic setting. We pay special 
attention to the maximal degree D of the graphs that we can handle, which 
turns to be roughly n/(21ogp). 
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1. Introduction 

Let us consider a Gaussian law Pc in with mean and positive definite 
covariance matrix C. We write 9 for the matrix of the regression coefficients 
associated to the law Pc, more precisely 6 = [^l"''] ; -^i is the p x p matrix 

(1) 

such that Oy = for j = 1, . . . ,p and 

E[X(J") I X^'^), k + j] = ^^(^^X^'^), J e {1, . . . ,p} , a.s. 

for any random vector X = (X*^^^ , ■ • ■ , X'^P'')^ of law Pc- Our aim is to estimate 
the matrix B by model selection from an n-sample Xi, . . . , X„ i.i.d. with law Pc- 
We will focus on the disadvantageous case where the sample size n is smaller 
than the dimension p. 

We call henceforth shape of 9, the set of the couples of integers such 
that 9l^^ ^ 0. The shape of 9 is usually represented by a graph g with p labeled 
vertices {1, . . . , p}, by setting an edge between the vertices i and j when 6*^^ ^ 0. 
This graph is well-defined since 6*^' = if and only if 9^^ = 0; the latter 
property may be seen e.g. on the formula 9^''^ = ^'^^ * J- 
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The graph g is of mtcrcst for the statistician since it depicts the conditional 
dependences of the variables X'^^^s. Actually, there is an edge between i and j if 
and only if X^'^ is not independent of X'^^^ conditionally on the other variables. 
The objective in Gaussian graphs estimation is usually to detect the graph g. 
Even if the purpose of our procedure is to estimate 9 and not g, we propose 
to simultaneously estimate g as follows. We associate with our estimator 9 of 
0, the graph g where we set an edge between the vertices i and j when is 
non-zero. 

Estimation of Gaussian graphs with n <C p is a current active field of research 
motivated by applications in postgenomic. Biotechnological developments (mi- 
croarrays, 2D-electrophoresis, etc) enable to produce a huge amount of pro- 
teomic and transcriptomic data. One of the challenge in postgenomic is to infer 
from these data the regulation network of a family of genes (or proteins). The 
task is challenging for the statistician due to the very high-dimensional nature 
of the data and the small sample size. For example, microarrays measure the 
expression levels of a few thousand genes (typically 4000) and the sample size 
n is no more than a few tens. The Gaussian graphical modeling appears to be a 
valuable tool for this issue, sec the papers of Kishino and Waddell [14], Dobra et 
al ['■)], Wu and Ye [21)]. The gene expression levels in the microarray arc modeled 
by a Gaussian law Vq and the regulation network of the genes is then depicted 
by the graph g of the conditional dependences. 

Various procedures have been proposed to perform graph estimation when 
p > n. Many are based on multiple testing, see for instance the papers of Schafer 
and Strimmer [IG], Drton and Perlman [8; 10] or Wille and Biihlmann [19]. We 
also mention the work of Vcrzelen and Villers [17] for testing in a non-asymptotic 
framework whether there are (or not) missing edges in a given graph. Recently, 
several authors advocate to take advantage of the nice computational proper- 
ties of the Z ^-penalization to cither estimate the graph g or the concentration 
matrix C~^. Meinshausen and Biihlmann [15] propose to learn the graph g by 
regressing with the Lasso each variable against the others. Huang et al. [1-3] or 
Yuan and Lin [21] (see also Banerjce et al. [1] and Friedman et al. [11]) suggest 
in turn to rather estimate by minimizing the log-likelihood for the concen- 
tration matrix penalized by the ^^-norm. The performance of these algorithms 
are mostly unknown: the few theoretical results are only valid under restrictive 
conditions on the covariance matrix and for large n (asymptotic setting). In 
addition to these few theoretical results, Villers et al. [18] propose a numerical 
investigation of the validity domain of some of the above mentioned procedures. 

Our aim in this work is to investigate Gaussian graph estimation by model se- 
lection from a non-asymptotic point of view. We propose a procedure to estimate 
9 and assess its performance in a non-asymptotic setting. Then, we discuss on 
the maximum degree of the graphs that we can accurately estimate and explore 
the performance of our estimation procedure in a small numerical study. 

We will use the Mean Square Error of Prediction (MSEP) as a criterion to 
assess the quality of our procedure. To define this quantity, we introduce a few 
notations. For any k,q gN, we write || • \\kxq for the Frobenius norm in R'^^', 
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namely \\A\\l^^ = Trace {A^A), for any A e M'''^«. The MSEP of the estimator 
9 is then 

MSEP(^) = E \\\C^'\e - J = E [||X^,„(0 - 

where C^^^ is the positive square root of C and X„eiu is a random vector, 
independent of 9, with distribution Pc. We underUne that the MSEP focus on 
the quahty of the estimation of 9 and not of g. In particular, we do not aim to 
estimate at best the "true" graph g, but rather to estimate at best the regression 
matrix 9. We choose this point of view for two reasons. First, we do not believe 
that the matrix 9 is exactly sparse in practice, in the sense that 9\^^ ~ for 
most of the i, j S {1, . . . Rather, we want to handle cases where the matrix 
9 is only approximately sparse, which means that there exists a sparse matrix 
9* which is a good approximation of 9. In this case, the shape g of may not 
be sparse at all, it can even be the complete graph. Our goal is then not to 
estimate g but rather to capture the main conditional dependences given by 
the shape g* of 9*. The second reason for considering the MSEP as a quality 
criterion for our procedure is that we want to quantify the fact that we do not 
want to miss the important conditional dependences, but we do not worry too 
much missing a weak one. In other words, even in the case where the shape g of 
9 is sparse, we are interested in finding the main edges of g (corresponding to 
strong conditional dependences) and we do not really care of missing a "weak" 
edge which is overwhelmed by the noise. The MSEP is a possible way to take 
this issue into account. 

To estimate 9, we will first introduce a collection A4 of graphs, which are our 
candidates for describing the shape g of 9. If we have no prior information on 
g, a possible choice for A4 is the set of all graphs with degree^ less than some 
fixed integer D. Then, we associate with each graph to G Al, an estimator 9m of 
9 by minimizing an empirical version of the MSEP with the constraint that the 
shape of 6',„ is given by m, see Section 2 for the details. Finally, we select one 
of the candidate graph to by minimizing a penalized empirical MSEP and set 
9 = 9rh- Our main result roughly states that when the candidate graphs have a 
degree smaller than n/(21ogp), the MSEP of 9 nearly achieves, up to a log(p) 
factor, the minimal MSEP of the collection of estimators {9m, m G A4}. 

It is of practical interest to know if the condition on the degree of the can- 
didate graphs can be avoided. This point is discussed in Section 3.1, where we 
emphasize that it is hopeless to try to estimate accurately graphs with a degree 
D large compared ton/ (l + log(p/n)). We also prove that the size of the penalty 
involved in the selection procedure is minimal in some sense. 

The remaining of the paper is organized as follows. After introducing a few 
notations, we describe the estimation procedure in Section 2 and state our main 
results in Section 3. Section 4 is devoted to a small numerical study and Section 6 
to the proofs. 



^the degree of a graph corresponds to the maximum number of edges incident to a vertex. 
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A few notations 

Before describing our estimation procedure, we introduce a few notations about 
graphs we shall use all along the paper. 

a. Graphs 

The set of the graphs with p vertices labeled by {1, . . . ,p} is in bijection with 
the set Q of all the subset g of {1, . . . fulfilling 

• ihj) i 9 for all j e {1, . . . 

• G ff ^ e 9 for aU i,j G {1, . . . 

Indeed, to any g £ Q we can associate a graph with p vertices labeled by 
{1, . . . , p} by setting an edge between the vertices i and j if and only if (i, j) G g. 
For simplicity, we call henceforth "graph" any element g oi Q. 

For a graph g E Q and an integer j G {1, . . . we set gj = {i : G g} 

and denote by \gj\ the cardinality of gj. Finally, wc define the degree of g by 
deg(5) = max{|gj| : 

6. Directed graphs 

As before, we will represent the set of the directed graph with p vertices labeled 
by {1, . . . ,p} by the set of all the subset g of {1, . . . fulfilling (j, j) ^ g 
for all i G {1, ■ • ■ More precisely, we associate with g G the directed 
graph with p vertices labeled by {1, ... ,p} and with directed edges from i to j 
if and only if (i, j) G g. 

We note that Q C and we extend to 5 G the above definitions of g^, 
l^jl, and deg((7). Although G is contained in it should be noted that the 
associated interpretation is different since the graphs in G^ are directed with 
possibly two directed edges between two vertices. 

2. Estimation procedure 

In this section, we explain our procedure to estimate 9. We first introduce a 
collection of graphs and models, then we associate with each model an estimator 
and finally we give a procedure to select one of them. 

2.1. Collection of graphs and models 

Our estimation procedure starts with the choice of either a collection M <Z G 
of graphs or a collection M C G^ of directed graphs which are our candidates 
to describe the shape of 6. Among the possible choices for we mention four 
of them: 

1. the set C ^? of all graphs with at most D edges, 

2. the set M.'^^'^ C of all graphs with degree less than D, 
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3. the set Jvvj^' c of ah directed graphs with at most D directed edges, 

4. the set A^^''^'^ C of aU directed graphs with degree less than D. 

We call degree of M the integer Dm = max{deg(TO) m € Al} and note that 
the above collections of graphs have a degree bounded by D. 

To the collection of graphs A^, we associate the following collection {6m, 
m G A^} of models to estimate 6. The model 8m is the linear space of those 
matrices in W^^ whose shape is given by the graph m, namely 

Gm = {A e Rf^P : (^, j) ^ m ^ Ap' = O}. 

As mentioned before, we known that 0^^^ = if and only if 9^^^ = 0, so 
it seems irrelevant to (possibly) introduce directed graphs instead of graphs. 
Nevertheless, we must keep in mind that our aim is to estimate 6 at best in terms 
of the MSEP. In some cases, the results can be improved when using directed 
graphs instead of graphs, typically when for some i, j € {1, . . . ,p} the variance 
of 6'P^X« is large compared to the conditional variance Var(X^-'^ |X^'"'\ k ^ j), 
whereas the variance of efx'-i^ is small compared to Var(X(*)|X('=), k ^ i). 
Finally, we note the following inclusions for the families of models mentioned 
above 

U Om C y Gm C y Om C [j Om- 

meA^*-+ meM* meM^°'^ 7neM'j^'^- + 



2.2. Collection of estimators 

We assume henceforth that 2> < n < p and that the degree Dm of is upper 
bounded by some integer D < n — 2. We start with n observations Xi, . . . , X„ 
i.i.d. with law Pc and we denote by X the n x p matrix X = [Xi, . . . , Xn]^ . In 
the following, we write A^^\ . . . , A'-p^ for the p columns of a matrix A G K'^^p. 

We remind the reader that \\C^/'^{I - e)\\'^ = inf^ee \\C^^^iI - A)f, where 
is the space oi p x p matrices with on the diagonal. An empirical version of 
||C"'^/^(/— A)|p is — A)|j^xpj which can also be viewed as an empirical 

version of the loss ||C"'^/^(A — d)\\^ , since by Pythagorean theorem |jC^/^(A — 
e)f = !|Ci/2(/ _ A)f - ||Ci/2(/ _ 6')||2, for all A € 6. 

In this direction, we associate with any m G A^, an estimator 9m of 9 by 
minimizing on Om this empirical risk 

- ^m)||Lp = mm \\X{I - A)||Lp. (1) 
We note that the p x p matrix then fulfills the equalities 

X9l{^ = Pi-O-ixeH' (^^'^) ' for J = 1, . . . , p, 
where is the hnear space Qm = {9^^^ ■ 9 e 9m} C MP and Proj^Q(j) 

is the orthogonal projector onto XOin in M" (for the usual scalar product). 
Hence, since the covariance matrix C is positive definite and D is less than n, 
the minimizer of (1) is unique a.s. 
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2.3. Selection procedure 

To estimate 0, we will select one of the estimator dm by minimizing some penal- 
ized version of the empirical risk \\X{I — 6m)\\^/n. More precisely, we set 9 = Om 
where m is any minimizcr on Ai of the criterion 



Crit(m) 



p 

E 



(2) 



with the penalty function pen : N of the form of the penalties introduced 

in Baraud et al. [4]. To compute this penalty, we define for any integers d and 
N the Dkhi function by 



Dkh\{d,N,x) 



F, 



d+2,N 



> 



d + 2 



X 

d 



N+2 



> 



N + 2 
Nd 



X > 0, 



where Fd^N denotes a Fisher random variable with d and N degrees of freedom. 
The function x i— > Dkhi(c?, N, x) is decreasing and we write EDkhi[c?, N, x] for its 
inverse, see [4] Section 6.1 for details. Then, we fix some constant K > 1 and 
set 



pen((i) = K ■ 



EDkhi 



d+l,n-d-l,{C^_,{d+lfy 



(3) 



Size of the penalty 

The size of the penalty pen{d) is roughly 2Kd\ogp for large values of p. Indeed, 
we will work in the sequel with collections of models, such that 

TL 

F>M < 11 o , for some < 1, 

2(1.1 + V15^) 

and then, we approximately have for large values of p and n 

pen(d) <A' (l + eV21ogp)^(d+l), d G {0, . . . , D^} , 

see Proposition 4 in Baraud et al. [4] for an exact bound. In Section 3.2, we 
show that the size of this penalty is minimal in some sense. 



Choice of the tuning parameter K 

Increasing the value of K decreases the size of the graph rh that is selected. 
The choice K = 2 gives good control of the MSEP of 0, both theoretically and 
numerically (see Section 3 and 4). If we want that the rate of false discovery of 
edges remains smaller than 5%, the choice K ^ i may also be appropriated. 
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Computational cost 

The computational cost of the selection procedure appears to be very high. 
For example, if = A^^^'^ the computational complexity of the procedure 
increases as p^-^+i) with the dimension p. In a future work [12], we will propose 
a modified version of this procedure, which presents a much smaller complexity. 

A few additional remarks on the estimation procedure 
1- The matrix 6 belongs to the set 

r = {e* e W^P : 3 K positive definite such that e,,^ = -Ki^^/Kjj, for i ^ j} , 

but the estimator 6 has no reason to belong to this space. To avoid this unpleas- 
ant feature, it would be natural to minimize (1) on the space &„i H F instead 
of &rn- Unfortunately, we do not know how to handle this case neither theo- 
retically nor numerically. We also emphasize that the matrix 9 is not assumed 
to be exactly sparse, so it does not belong to any of the {9„i n F, m G M} in 
general. In particular, it is unclear whether the MSEP of the estimator obtained 
by minimizing (1) on 0,„ n F is smaller than the MSEP of 9m- 

2- In the special case where Ai = Ai'^^'^ , the minimization of (2) can be 
obtained by minimizing jjX^^^ — X9m\\^ x ^1 + ^°"^|™^|^ ^ independently for 

each j. This nice computational feature does not hold for the other collections 
of graphs introduced in Section 2.1. 

3. The main result 

Next theorem gives an upper-bound on the MSEP of a slight variation 9 oi 9, 
defined by 

= l{||eu)||<VpT„}' for all J e {1, . . . , with T„ = n^'^s". (4) 

We note that 9 and 9 coincide in practice since the threshold level r„ increases 
very fast with n, e.g. T20 « 6.10''. 

In the sequel, we write cr| = (Cj-j) = Var(X(j) | X^^^A: 7^ j) and define 
Om by 

\\C'^'(0-9m)f = min \\C'^'{9 - Am)W'. 

Theorem 1. Assume thatp > ?i > 3 and Dm = max{deg(m), m € M} fulfills 
the condition 

1 < Dm 5: V ^^^^1 /o'' some ry < 1. (5) 

2(l.i + VT^) 
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Then, the MSEP of the estimator 9 defined by (4) is upper bounded by 

pcn(dcg(m)) ' 



E 



\\C'/\e-9)r] < ciK,^) min\\\C'/'{9-9„,)f(l 



n — deg(?Ti) 

-^(pen(|m,|) + iflogn)a2 \+R„{tj,C) (6) 



n 



where K is the constant appearing in (3), c{K,r]) = (^J^_■^^•^^_^•^^ and the resid- 
ual term Rn{r],C) (made explicit in the proof) is of order ap^n~^'°s". 

The proof of Theorem 1 and of the next corollary is delayed to Section 6.3. 

Corollary 1. Assume thatp > n > 3 and that Condition (5) holds. Then, there 
exists some constant CK,ri, depending on K and rj only, such that 

MSEP(^) < Cif.^ log(p) X f min {MSEP(0„)} V _ (?)||2V ^^^^ (7) 

\m^M n J 

where i?„ = RniVj C!) is of order a 

Corollary 1 roughly states that when the candidate graphs have a degree 
smaller than ?7/(21ogp), the MSEP of 9 nearly achieves, up to a log(p) factor, 
the minimal MSEP of the collection of estimators {6'm, m € Al}. In particular, 
if g e 7W, the MSEP of 9 is upper-bounded by log(p) times the MSEP of ^g, 
which in turn is roughly upper bounded by deg(g) x ||C^/^(/ — log(p)/n. 

The additional term n~^||C^/^(/— in (7) can be interpreted as a minimal 
variance for the estimation of 9. This minimal variance is due to the inability of 
the procedure to detect with probability one whether an isolated vertex of g is 
isolated or not. We mention that when each vertex of the graph g is connected to 
at least one other vertex, this variance term n~^||C^/^(/ — 9)\\'^ remains smaller 
than the MSEP of 9^. 

Below, we discuss on the necessity of Condition (5) on the degree of the 
graphs and on the size of the penalty. 



3.1. Is Condition (5) avoidable? 

Condition (5) requires that Dm remains small compared to n/ (2 logp). We may 
wonder if this condition is necessary, or if we can hope to handle graphs with 
larger degree D. A glance at the proof of Theorem 1 shows that Condition (5) can 

be replaced by the weaker condition [^/Dm + 1 + \J'^^ogC^^ + l/(4C^^)j < 

rjn. Using the classical bound Cp_i < (ep/D)-^, we obtain that the latter con- 
dition is satisfied when 



(8) 
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SO we can replace Condition (5) by Condition (8) in Theorem 1. Let us check 
now that we cannot improve (up to a multiphcative constant) upon (8). 

Phythagorean equahty gives \\C^I^(e-e)\\'^ = \\C^/^{I~e)\\^ -\\C^/^{I~e)f , 
so there is no hope to control the size of \\C^/'^{9 — 9)\\'^ if we do not have for 
some 5 e (0, 1) the inequalities 

{1-6)\\C''\I-A)\\,., 
<^\\X{I-A)\U^p<{l + 5)\\C^/-\l^A)\\p^.p for all A e M 6™ (9) 

with large probability. Under Condition (5) or (8), Lemma 1 Section 6 ensures 
that these inequalities hold for any 5 > y/fj with probability 1 — 2exp(— r7,((5 — 
y/rj)'^ /2). We emphasize next that in the simple case where C = I, there exists 
a constant c{S) > (depending on 6 only) such that the Inequalities (9) cannot 
hold if M* C M or M*''^ C M with 

n 

D>c{S)- 

Indeed, when C ^ I and Mf) C M (or Mf,''^ C M), the Inequalities (9) 
enforces that n~^/^X satisfies the so-called (5-Restricted Isometry Property of 
order D introduced by Candcs and Tao ["i], namely 

(1 - (5)||/3||pxi < ||n-i/2jf/3||pxp < (1 + SmUi 

for all (3 in W with at most D non-zero components. Barabiuk et al. [2] (see 
also Cohen et al. [6]) have noticed that there exists some constant c{5) > 
(depending on 6 only) such that no n x p matrix can fulfill the (5-Restricted 
Isometry Property of order D if D > c{S)n/{l + log(p/n)). In particular, the 
matrix X cannot satisfies the Inequalities (9) when C A4 (or A^*'^ C A4) 
with D > c{S)n/{l + \og{p/n)). 

3.2. Can we choose a smaller penalty? 

As mentioned before, under Condition (5) the penalty pen(c?) given by (3) is 
approximately upper bounded by K (l + logp) (d+l). Similarly to The- 
orem 1 in Baraud et al. [4] , a slight variation of the proof of Theorem 1 enables 
to justify the use of a penalty of the form pen(d) = 2Kd\og{p — 1) with K > 1 
as long as Dm remains small (the condition on Dj^ is then much stronger 
than Condition (5)). We underline in this section, that it is not recommended 
to choose a smaller penalty. Indeed, next proposition shows that choosing a 
penalty of the form pen(c?) = 2(1 — 7)dlog(p — 1) for some 7 G (0, 1) leads to a 
strong overfitting in the simple case where 9 = 0, which corresponds to C = /. 

Proposition 1. Consider three integers 1 < D < n < p such thatp > e'^^^^^''^ + 
1 and M% d M or M%'^ C M. Assume that pcn((i) = 2(1 - ^)d\og{p - 1) for 
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some 7 G (0, 1) and 6 — 0. Then, there exists some constant 0(7) made explicit 
in the proof, such that when rh is selected according to (2) 



In addition, in the case where M = M.J^^ , we have 



> 1 - 3(p - 1)-^ - 2e-'''"/^' for all j € {1, . . . 



4. Numerical study 

In this section, we carry out a small simulation study to evaluate the perfor- 
mance of our procedure. Our study concerns the behaviour of the estimator 9 
when the sparsity decreases (Section 4.2) or when the number of covariates p 
increases (Section 4.3). In this direction, we fix the sample size n to 15 (a typ- 
ical value in post-genomics) and run simulations for different values of p and 
for different sparsity levels. For comparison, we include the procedure "or" of 
Meinshausen and Biihlmann [15]. This choice is based on the numerical study 
of Villers et al. [1>^], where this procedure achieves a good trade-off between the 
power and the FDR. We write henceforth "MB" to refer to this procedure. 



^.1. Simulation scheme 

The graphs g are sampled according to the Erdos-Renyi model: starting from a 
graph with p vertices and no edges, we set edges between each couple of vertices 
at random with probability q (independently of the others). Then, we associate 
with a graph g a positive-definite matrix K with shape given by g as follows. 
For each {i,j) € g, we draw Ki j = Kj^i from the uniform distribution in [—1, 1] 
and set the elements on the diagonal of K in such a way that K is diagonal 
dominant, and thus positive definite. Finally, we normalize K to have ones on 
the diagonal and set C ~ K^^ . 

For each value of p and q we sample 20 graphs and covariance matrices 
C . Then, for each covariance matrix C, we generate 200 independent samples 
{Xi, . . . , X15) of size 15 with law Pc. For each sample, we estimate 9 with our 
procedure and the procedure of Meinshausen and Biihlmann. For our proce- 
dure, we set M. = M'^^ and iiT = 2 or 2.5. For Meinshausen and Biihlmann's 
estimator 9ms we set A according to (9) in [ ] with a ~ 5%, as recommended 
by the authors. 

On the basis of the 20*200 simulations we evaluate the risk ratio 



rRid:- ^^"^^^^^ 

min^MSEP(^„)' 
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as well as the power and the FDR for the detection of the edges of the graph g. 
The calculations are made with R www.r-project.org/. 

4-2. Decreasing the sparsity 

To investigate the behaviour of the procedure when the sparsity decreases, we fix 
{n,p) = (15, 10) and consider the three graph-density levels q = 10%, q — 30% 
and q ~ 33%. The results are reported in Table 1. 

When q = 10% the procedures have a good performance. They detect on 
average more than 80% of the edges with a FDR lower than 5% and a risk ratio 
around 2.5. We note that MB has a slightly larger risk ratio than our procedure, 
but also a slightly smaller FDR. 

When q increases above 30% the performances of the procedures decline 
abruptly. They detect less than 25% of the edges on average and the risk ratio 
increases above 4. When q = 30% or q = 33% our procedure is more powerful 
than MB, with a risk ratio 33% smaller. 

In this simulation study, all the candidate graphs have a degree smaller than 
4. Using candidate graphs with a larger degree should not change the nature of 
the results. Actually, when g = 30 or 33%, less than 2% of the selected graphs 
have a degree equal to 4 and the mean degree of the selected graphs is between 
1 and 2. 

4-3. Increasing the number of covariates 

In this section, we focus on the quality of the estimation of 6 and g when the 
number of covariates p increases. We thus fix the sample size n to 15 and the 
sparsity index s pq to 1. This last index corresponds to the mean degree of 
a vertex in the Erdos-Renyi model. Then, we run simulations for three values 
of p, namely p = 15, p = 20 and p ~ 40 (in this last case we set A4 ~ A^g"*' to 
reduce the computational time). The results are reported in Table 2. 



Table 1 

Our procedure with K = 2, K = 2.5 and MB procedure: Risk ratio (r.Risk), Power and FDR 
when n = 15, p = 10 and q = 10%, 30% and 33%. 





q = 10% 


q = 30% 


q = 33% 


Estimator 


r.Risk 


Power 


FDR 


r.Risk 


Power 


FDR 


r.Risk 


Power 


FDR 


K = 2 


2.3 


82% 


4.9% 


4.3 


23% 


6.8% 


4.4 


13% 


5.6% 


K = 2.5 


2.5 


81% 


4.4% 


4.9 


20% 


5.4% 


4.9 


10% 


4.1% 


MB 


3.3 


81% 


3.7% 


6.9 


14% 


2.9% 


6.4 


3.8% 


1.1% 


Our procedure with K 


= 2, K = 

when n 


Table 2 

2.5 and MB procedure: Risk ratio (r.Risk) 
= 15, s = 1 and p = 15, 20 and 40. 


Power and FDR 




p = 15 


p = 20 


p = 40 


Estimator 


r.Risk 


Power 


FDR 


r.Risk 


Power 


FDR 


r.Risk 


Power 


FDR 


K = 2 


3.6 


74% 


6.6% 


3.7 


69% 


6% 


5.4 


68% 


5.4 % 


K = 2.5 


4.3 


72% 


6% 


4.4 


68% 


5.3% 


6.5 


67% 


4.7% 


MB 


17 


60% 


4% 


160 


20% 


4.8% 


340 


0.0% 


0.0% 
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When the number p of eovariates inereases, the risk ratios of the procedures 
increase and their power decrease. Nevertheless, the performance of our pro- 
cedure remains good, with a risk ratio between 3.6 and 6.5, a power close to 
70% and a FDR around 5.6 ± 1%. In contrast, the performances of MB decrease 
abruptly when p increases. For values oi p larger or equal to 22 (not shown), 
MB procedure does not detect any edge anymore. This phenomenon was already 
noticed in Villers et al. [18]. 

5. Conclusion 

In this paper, we propose to estimate the matrix of regression coefficients by 
minimizing some penalized empirical risk. The resulting estimator has some nice 
theoretical and practical properties. From a theoretical point of view. Theorem 1 
ensures that the MSEP of the estimator can be upper-bounded in terms of the 
minimum of the MSEP of the {9m, m & AA} in a non-asymptotic setting and 
with no condition on the covariance matrix C . From a more practical point 
of view, the simulations of the previous section exhibit a good behaviour of 
the estimator. The power and the risk of our procedure are better than those 
of the procedure of Meinshausen and Biihlmann, especially when p increases. 
The downside of this better power is a slightly higher FDR of our procedure 
compared to that of Meinshausen and Biihlmann. If the FDR should be reduced, 
we recommend to set the tuning parameter K to a larger value, e.g. K = 3. 

The main drawback of our procedure is its computational cost and in practice 
it cannot be used when p is larger than 50. In a future work [12], we propose a 
modification of the procedure that enables to handle much larger values of p. 

Finally, we emphasize that our procedure can only estimate accurately graphs 
with a degree smaller than n/ (2 \ogp) and as explained in Section 3.1, we cannot 
improve (up to a constant) on this condition. 

6. Proofs 

6.1. A concentration inequality 

Lemma 1. Consider three integers 1 < d < n < p, a collection Vi, . . . , Vn of 
d- dimensional linear subspaces of W and a n x p matrix Z whose coefficients 
are i.i.d. with standard gaussian distribution. We set \\ 

' \\n — II ' llrixl/v^ dild 



K{z) 



inf 

ueViU-.-uViv 



\\Zv\ 



n 



pxl 



Then, for any x > 




(10) 



where Af has a standard Gaussian distribution and ~ (A^v^SToglV) 



-1 
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Similarly, for any x>0 

pf sup J^>l + ^^^±^^+^" + ^'| <F(AA>x)<e-V^ 

(11) 

Proof. The map Z ^ (-^71 A^(Z)) is 1-Lipschitz, therefore the Gaussian concen- 
tration inequality enforces that 

P (AS(Z) < E {Ki{Z)) - x/^) < P (AA > x) < e-^''\ 

To get (10), we need to bound E (A^(Z)) from below. For i = 1, . . . , A^, we set 

Xi{Z) = mf . 

^ev, \\v\\ 

We get from [~] the bound 

hence there exists some standard Gaussian random variables Mi such that 

K {Z)>1- ^/dj^ - (M)+ /V", 

where (a;)+ denotes the positive part of x. Starting from Jensen's inequality, we 
have for any A > 



E 



( _max^(M)+) < ^ logE(e^'°^'^-=i- ■■•"(•^■)+) 
< -log^E 

\i=l 



< i logiV+i log(e^'/2 + l/2) 

logA^ A e-^'/2 

< — \ \ . 

A 2 2A 



Setting A = v21oglV, we finally get 

V21ogiV + ^jv 



;(AS(Z)) = e(. inin A,(Z) 



> 1 - 



This concludes the proof of (10) and the proof of (11) is similar. □ 
6.2. Proof of Corollary 1 

Corollary 1 is a direct consequence of Theorem 1 and of the three following 
facts. 
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1. The equality Y''j=i = \\C^''^{I - OW holds. 

2. Proposition 4 in Baraud et al. [4] ensures that when Dm fulfills Condi- 
tion (5) , there exists a constant C (K, rj) depending on K and rj only, such 
that 

^^^^ < CiK, ri) for all d<DM- 

n — a 

3. When Dm fulfills (5) the MSEP of the estimator 6',„ is bounded from 
below by 



E 



The latter inequality follows directly from Lemma 1. 

Finally, to give an idea of the size of C{K,1]), we mention the following 
approximate bound (for n and p large) 



pen(i^^) ^ K{l + e^V2l^) 
C{K, rf) = — < — ^ X 77 



n-D 



•m 



n - Dm 



2(1.1 + VT^)^ 



Kr] e 



2r, 



6.3. Proof of Theorem 1 



The proof is split into two parts. 

First, we bound from above E [|| C^/^ (6^ - 6I) || 2] by (l - E - e*) || ^ ] - 

Rn- Then, we bound this last term by the right hand side of (6). 

To keep formulas short, we write henceforth D for Dm- 
a. Prom ¥.[\\C^''^{e - 6»)f ] to E[||X(^ - e)\\l] . 



We set 



||nxl/v^, Ao = (1 - ^) , 



= and A:=infjHC3^:.e U 



|Ci/26»0")| 



\v\\ 



where = C^^^ < 9^^'' > +C'^^'^ei?,^ and M* jj is the set of those subsets m of 
{1, . . . , j — 1, j + 1, . . . ,p} X {j} with cardinality D. Then, for any 7 = 1, . . . ,p 



E 



= E 



-f E 



II^^^^^''''^II^1{A*>Ao, e(3)=0, Ai<3/2} 
II^^^^^''''^II^1{A*>Ao, 9U)=0, Ai>3/2} 



-E 



0) ,_^U) 



■E 
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We prove in the next paragraphs that Y!j=i^i^ - i^^ ^ ^)\\V\ and 

that the residual term Rn{-q, C) = Ej=i(E2^^ + + IE4 ') is of order p^T~'^. 
The proofs of these bounds bear the same flavor as the proof of Theorem 1 in 



Baraud [■>]. 

<A on . Si 



Upper bound on E^"*' . Since 



we have 
and therefore 

■ . (12) 



Upper bound on e[^\ All we need is to bound P(A* > Aq, 9'^^^ = 0, A] < 3/2) 
from above. Writing A^ for the smallest eigenvalue of C, we have on the event 

{a;>Ao} 



AoVA^ ■ 
Besides, for any m £ /A, 

X0« = Proj^^o, + a,£(-'")) 

with e^-') distributed as a standard Gaussian random variable in M". 
Therefore, on the event {A* > Ao, 9^^'> =0, A] < 3/2} wc have 

AoVA^ 

^ 1.5||CW j')|| +aj||£(-'')||,. 



AnVA 

As a consequence, 

p(a* > Ao, e^'^ -0, A] < 3/2) 

V An V A 



< 



< 



(2a,||£W||„ > Aov^r„) 



when 3 II Ci/26)0) II > Xoy^T„ 



else. 



9||CV20W||V(A2A-pr2) when 3 11^1/200") II > Xo^I^T^ 
31 



4ay{Xl\-pT^) else 



C. Giraud/ Estimation of Gaussian graphs 



557 



Finally, 



9 IIC^/26'(^')|P + 40-2 



(13) 



Upper hound on £3 • Wc note that ^t- (Aj) follows a distribution, with n 
degrees of freedom. Markov inequality then yields the bound 

P (A] > 3/2) < exp I (9/4 - 1 - log(9/4))) < exp(-ri/5). 

As a consequence, we have 

4^'^ < |lCi/26l«||2exp(-n/5). (14) 

Upper hound on 'e!'^ . Writing A+ for the largest eigenvalue of the covariance 
matrix C, we have 



< 21 



+ ^^^^^^^^ 



< 2 + A+pT,?) P (A* < Ao) . 

The random variable Z = XC~^^^ is nxp matrix whose coefficients are i.i.d. and 
have the standard Gaussian distribution. The condition (5) enforces the bound 



VDTT+ j2\og\Mlo\+SiMi,^ 



< Vv, 



so Lemma 1 ensures that 

P (A* < Ao) < exp i-n{l - ^])rj/2) 

and finally 

4'' < 2 + A+pT^) exp {-n{l - ^)v/2) 

Conclusion. Putting together the bounds (12) to (15), we obtain 



(15) 



E 



< A(7^E 



with i?„(7^,C) -n=i(E. 



E 



E'-f) of order B. p'^T- 



\\X{§-0)\\1\+R,,{tj,C) 
(16) 



2^-4 log ?l_ 



b. Upper bound on E 

Starting from the inequality 

p 



x{9~e)\\ 



5] ||X(^)-X^«|px 1 + 



Let m* be an arbitrary index in M. 

pcn{\rhj\y 



< J2 \\x^^^ -xe. 

3 = 1 \ 



||2 ^ / ^ Pcn(|m*|) ^ 
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and following the same lines as in the proof of Theorem 2 in Baraud et al. [4] 
we obtain for any K > 1 



K-l 

K 



p 



\X{e^'^ - + + ^ I - pcn(|„.,, 



0) 



where for any m e and j G { 1 , . . . , p} 

X^W =Proj^QO,(X0(^)), 



E 



<Pcn(K-|) 



n — m," 



a.s. 



and the two random variables Um] and Vm} are independent with a X^(l™il + 1) 
and a x^('t- ~ ^ 1) distribution respectively. Combining this bound with 
Lemma 6 in Baraud et al. [4], we get 



K- 1 



K 



\[\\x{e-e)\[ 



< 



E[\\X{0~e„,.)\\l] +£pen(|m*|) 



E[||X(gO-)-g2l)f1 



i=l nijEMj 



X Dkhi I Im^l + 1, n — \mj\ 



{n- \mj \ - l)pen(|mj-|) \ 



where A^^ = {mj, m e A^}. The choice (3) of the penalty ensures that the last 
term is upper bounded by A'^^^-^ cr| log(ri)/n. We also note that ||X(6'(-') — 

e'^^,)\\l < \\Xiei^)-el^l)\\l foraU.? e {l, . . . ,p} since X^Hi = Proj^^u, (X^^-)). 



Combining this inequality with E 
we obtain 

K ~ 1 



K 



< 



\\X{9- 



n — m" 



log. 



< r^/^(^-^™oip(i + ^)+E 



(pen(|mj|) + iflogn)^ (17) 
n 
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c. Conclusion. The bound (17) is true for any m*, so combined with (16) it 
gives (6). 

6.4- Proof of Proposition 1 

The proof of Proposition 1 is based on the following Lemma. 

Let us consider anxp random matrix Z whose coefficients Z^"*^ are i.i.d. with 
standard Gaussian distribution and a random variable e independant of Z, with 
standard Gaussian law in M". 

To any subset s of {1, . . . ,p} we associate the linear space Vs = spanjcj, j € 
s} C MP, where {ei, . . . , Cp} is the canonical basis of K.P. We write Z9s = 
Proj^y (e), we denote by Sd the set of cardinality d such that 

||Z0,J|2 = max||Z^,||2. (18) 

\s\=d 

and we define 

Crit'(.) = ||s-Z^.f fl+P^ 

Lemma 2. Assume that p > e^^^^^''^ and pcn{d) = 2(1 — ^)d\ogp. We write 
Dn,p for the largest integer smaller than 

5-D/6, -TTiK and , 

' (41ogp)3/2 512(1.1 + Vl^)^ 

Then, the probability to have 

Crit'(s) > Crit'(s£)^ ^) for all s with cardinality smaller than 7D„_p/6 

is bounded from below by 1 — Sp~^ — 2 exp(— n7^/512). 

The proof of this lemma is technical and in a first time we only give a sketch 
of it. For the details, we refer to Section 6.5. 

Sketch of the proof of Lemma 2. We have 

\\Z§sf = \\ef- inf \\e-Zaf 

d^Vs 

= sup [2 <e,Za> -\\Za\\^] . 

According to Lemma 1, when |s| is small compared to n/ \ogp, we have ||Zq;|P w 
n||Q;|p with large probability and then 

\\Z0sf « sup [2 < Z^e,a> -n\\af] = - \\PToiy {Z^s)f. 

aeVs 

Now, Z^e = \\e\\Y with Y independent of e and with J\f{Q,Ip) distribution, so 



2 



\Z9,r^^ WPvoiyY 



2 
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Since ma.yi\s\=d llProjy « 2d\ogp with large probability, we have 11^6'^^ 
2d\ogp X ||£|p/n and then 

mm Crit (s) ~ Crit [Sd) ~ e 1 

\s\=d \ n 

Therefore, with large probability we have Crit'(s) > Ciit' {sd„ for all s with 
cardinality less than "fDn,p/6. 

Proof of Proposition 1. We start with the case Mf^'^ C A^. When |m| < 
7D„^p_i/6, we have in particular \rhi\ < We build m from m 

by replacing ?fii by a set mi C {1} x {2,...,p} which maximizes 
among all the subset rhi of {1} x {2,...,p} with cardinality Dn,p~i- It fol- 
lows from Lemma 2 (with p replaced by p — 1) that the probability to have 
Crit(m) < Crit(m) is bounded from above by 3{p — 1)^^ + 2 exp(— n7^/512). 
Since rh £ M'fj^ , the first part of Proposition 1 follows. When C M, the 
proof is similar. 

When M'^^'^'^ C A^, the same argument shows that for any j e {1, . . . ,p} 
the probability to have \rhj \ < ^yDn^p-i/G is bounded from above by 3(p— 1)^^ + 
2exp(-?i7V512). ' □ 



6. 5. Proof of Lemma 2 

We write D for Dn.p and flo for the event 



\Z0sJ^>2D{l-j/2)\\e\\l\ogp and 

<2\s\ (2 + 7)||e||2 logp, for all s with \s\ < D 



We will prove first that on the event flo we have Crit'(s) > Crit'(s£)^p) for 
any s with cardinality less than 7-Dri.p/6 and then we will prove that flo has a 
probability bounded from below by 1 — 3p~^ — 2 exp(— n7^/512). 

We write A(s) = Crit'(s£)) — Crit'(s). Since we are interested in the sign of 
A(s), we will still write A(s) for any positive constant times A(s). We have 
on r^o 

A(s) / 21ogp, , , \ f pcn(D)\ 



We note that pen(|s|)/(?? — |s|) < pen(£')/(n — D). Multiplying by n/(21ogp) 
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we obtain 



A„ < (i-.)D(i + ^-^";:g'°'°^'- )-(i-./2)z^ 

- (1 - 7)l»l + (2 + t)I»I + (2 + 7)|.|22^ 

L» - 2(1 - 7/2)Dlogp + 2(2 + 7)|s| logp 



-(l-7/2)i?+(l + 27)|s|. 

When p > e^l'y^^t) and |s| < "fD/Q the first term on the riglit liand side is 
bounded from above by (1 — 7)£', then since 7 < 1 

A(s) < (1 + 27)7i:i/6- 715/2 < 0. 

We will now bound P(fi§) from above. We write Y = Z'^e/\\e\\ (with the 
convention that Y = when e = 0) and 



^<^<(l-7/2r"\ f„a„ac U^k[, 



^2 = I max ||Proj,^y|p > 2(1 -7/2)1/^1) logpl 



fls = < max <41ogp 
[1=1,.. ..p 

We first prove that fli O H il^ C Hq. Indeed, we have on fli n fl-z 
WZOsJl^ = max sup [2 < e,Za > -llZajpl 

\s\=D aeVs 

> max sup [2 < Z^e, a > -n(l - 7/2)"i/2||^|| 

> ii^^Z?&^max|lProj,rf 

> 2i?(l-7/2)||£||2logp. 

Similarly, on Oi we have < ||£||^||Projy^rf x (2 + 7)/2 for all s with 

cardinality less than D. Since ||Projy^y|p < |s| max,i=i_..._p(Fj^), we have on 

\\Z§,\\' <2{2 + j)\.s\\\s\\l\ogp, 

for all s with cardinality less than D and then fii n n f^a C r^o- 

To conclude, we bound P(ri^) from above, for i = 1, 2, 3. First, we have 



r{ni) = P max Y,^ > Alogp < 2pP(Fi > 2y/log(pj) < 2p-^. 
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To bound P(fif), wc note that {l--f/2)-^^^ > l+7/8and ^/2/{2 + j) < I-7/8 
for any < 7 < 1, so Lemma 1 ensures that P{ni) < 2e-"'''/5i^ Finally, to 
bound P(r22); "^"^ sort the in decreasing order Y^-^^ > y^j) > • • • > Y^p-^ and 



note that 



inax\\ProiyYf>DY^%. 

\s\=D 



Furthermore, we have 

P(y,i,, <2(l-7/2)"=logp) < (''^'V^jf <2(l-7/2)"2logp)'"''*' 



P-D+1 



4(1-7/2)1/4^/^1^ 



where the last inequality follows from p > e^l '''' and Inequality (60) in Baraud 
et al. [4]. Finally, we obtain 



(y^I^ < 2(1 - 7/2)1/2 i^g^^ < p-i iQgp _ 



(p - £) + l)pVi-T/2 ^ 
-7/2)i/4^/2l5i^^ 



where the last inequality comes from D < p''/'i/(41ogp)3/2. To conclude P(r2|) < 
p-i and P(f}g) < 3p-i + 2exp(-n77512). 
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